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Abstract — In a recent paper, the authors proposed a new class 
of low-complexity iterative thresholding algorithms for recon- 
structing sparse signals from a small set of linear measurements 
ffl . The new algorithms are broadly referred to as AMP, for 
approximate message passing. This is the first of two conference 
papers describing the derivation of these algorithms, connection 
with the related literature, extensions of the original framework, 
and new empirical evidence. 

In particular, the present paper outlines the derivation of 
AMP from standard sum-product belief propagation, and its 
extension in several directions. We also discuss relations with 
formal calculations based on statistical mechanics methods. 

I. Introduction 

Let s be a vector in IR^. We observe n < N linear 
measurements of this vector through the matrix A, y = As a . 
The goal is to recover s from (y,A). Although the system 
of equations is underdetermined, the underlying signal can 
still be recovered exactly or approximately if it is 'simple' 
or 'structured' in an appropriate sense. A specific notion 
of 'simplicity' postulates that s is exactly or approximately 
sparse. 

The £i minimization, also known as the basis pursuit 
0, has attracted attention for its success in solving such 
underdetermined systems. It consists in solving the following 
optimization problem: 

minimize || s || i , subject to As = y . (1) 

The solution of this problem can be obtained through generic 
linear programming (LP) algorithms. While LP has polynomial 
complexity, standard LP solvers are too complex for use in 
large scale applications, such as magnetic resonance imaging 
and seismic data analysis. Low computational complexity of 
iterative thresholding algorithms has made them an appeal- 
ing choice for such applications. Many variations of these 
approaches have been proposed. The interested reader is 
referred to f3| for a survey and detailed comparison. The 
final conclusion of that paper is rather disappointing: optimally 
tuned iterative thresholding algorithms have a significantly 
worse sparsity-undersampling tradeoff than basis pursuit. 

Recently (1), we proposed an algorithm that appears to 
offer the best of both worlds: the low complexity of iterative 
thresholding algorithm, and the reconstruction power of the 
basis pursuit Qj. This algorithm is in fact an instance of 
a broader family of algorithms, that was called AMP, for 
approximate message passing, in |fl~). The goal of this paper is 



to justify AMP by applying sum-product belief propagation for 
a suitable joint distribution over the variables si, S2, ■ ■ ■ , sat. 

The paper is organized as follows: In Section HI1 we explain 
the notations used in this paper. We then derive the AMP 
algorithm associated the basis pursuit problem in Section [Till 
In Section IIV1 we consider the AMP for the basis pursuit 
denoising (BPDN) or Lasso problem. We will also generalize 
the algorithm to the Bayesian setting where the distribution 
of the elements of s a is known, in Section [V] Finally we will 
explain the connection with formal calculations based on non- 
rigorous statistical mechanics methods in Section [VTl 

Due to space limitations, proofs are omitted and can be 
found in a longer version of this paper J4). 

II. Notations 

The letters a, b, c, . . . denote indices in [n] = {1, . . . , n} 
and i,j, k, . . . represent indices in [N] = {1, . . . , N}. The 
a,i element of the matrix A will be indicated as A a i. The 
elements of the vectors y, s, x, and s a are indicated by y a , 
Si, Xi, and s Dl j respectively. 

The ratio 5 — n/N is a measure of indeterminacy of the 
system of equations. Whenever we refer to the large system 
limit we consider the case where N, n — > oo with 5 fixed. 
In this limit the typical entry of A should scale as 1/ y/n. In 
the concrete derivation, for the sake of simplicity we assume 
that A a i € {+l/y/n, —1/y/n}. This assumption is not crucial, 
and only simplifies the calculations. Although the algorithms 
are developed from the large system limit, in practice, they 
perform well even in the medium size problems with 'just' 
thousands of variables and hundreds of measurements |5). 

III. AMP for the Basis Pursuit 

In this section we consider the the basis pursuit problem as 
defined in Eq. ([TJ. The derivation of AMP proceeds in 4 steps: 

(1) Construct a joint distribution over (si, . . . , sn), parame- 
terized by (3 £ R + , associated with the problem of interest 
and write down the corresponding sum-product algorithm. 

(2) Show, by a central limit theorem argument, that for the 
large system limit, the sum-product messages can well be 
approximated by the families with two scalar parameters. 
Derive the update rules for these parameters. 

(3) Take the limit f3 — > oo and get the appropriate rules for 
basis pursuit problem. 

(4) Approximate the message passing rules for the large 
system limit. The resulting algorithm is AMP. 



A. Construction of the graphical model 

We consider the following joint probability distribution over 
the variables Si,S2, ■ ■ ■ Sjv 

N n 

M(ds) = 7]] exp W Sl l)Il^-(^U ■ ( 2 ) 



0=1 



Here frr^w^,^} denotes a Dirac distribution on the hyper- 
plane y a = (Ax) a . Products of such distributions associated 
with distinct hyperplanes yield a well defined measure. As we 
let (3 — > 00, the mass of /i concentrates around the solution 
of (Q~|). If the minimizer is unique and we have access to the 
marginals of n, we can therefore solve ([TJ. Belief propagation 
provides low-complexity heuristics for approximating such 
marginals. 

In order to introduce belief propagation, consider the factor 
graph G = (V,F,E) with variable nodes V = [N], factor 
nodes F = [n] and edges E = [N] x [n] — {(i,a) : 
i 6 [N], a G [n]}. Hence G is the complete bipartite graph 
with TV variable nodes and n factor nodes. It is clear that 
the joint distribution (O is structured according to this factor 
graph. Associated with the edges of this graph are the belief 
propagation messages }iev,aec and }iev,aec In 
the present case, messages are probability measures over the 
real line. The update rules for these densities are 



b^a 



,00 



(3) 



(4) 



where a superscript denotes the iteration number and the 
symbol = denotes identity between probability distributions 
up to a normalization constant 

B. Large system limit 

A key remark is that in the large system limit, the messages 
t^a^i ( • ) are approximately Gaussian densities with variances 
of order N, and the messages v\_^ a ( • ) are accurately approxi- 
mated by the product of a Gaussian and a Laplace density. We 
state this fact formally below. Recall that, given two measure 
Hi and H2 over R, their Kolmogorov distance is given by 
Hmi - M2II1C = sup aeR \fii(-oo,a] ~ fi 2 (-oo,a]\. 

The first Lemma is an estimate of the messages 

Lemma III.l. Let Xj_, a and (t|^ q //3) be, respectively, the 
mean and the variance of the distribution ut.„. Assume 
further J \sj\ 3 duj_ >a (sj) < Ct uniformly in N,n. Then there 
exists a constant C[ such that 

CI 




exp 



2f* 



■{Aa 



(5) 



'More precisely, given two non-negative functions p, q : Q — > R over the 
same space, we write p(s)=q(s) if there exists a positive constant a such 
that p(s) = a q(s) for every s £ Q. 



where the distribution parameters are given by 

z a^i — Da — ^ ] A a jXj_> a , T a ^ i — y ' A a jTj_> a . (6) 

Motivated by this lemma, we consider the computation of 
the means and the variances of the messages f*.ji(sj). It is 
convenient to introduce a family of densities 

Ms; x ' b) 55 T^b) exp { ~ /3|s| " l (s " x)2 } ■ (7) 

Also let Fp and Gp denote its mean and variance, i.e., 

Fp(x;b) =E /0( .. Xib) (Z), Gp(x;b) = Var^ ( . . X , 6) (Z) . (8) 

^From Eq. ©, we expect f to concentrate tightly. There- 
fore we assume that it is independent of the edge (i, a). 

Lemma III.2. Suppose that at iteration t, the messages from 
the factor nodes to the variable nodes are v t a ^ i — %^i> 
with defined as in Eq. @ with parameters Z^_^ and 

f*. Then at the next iteration we have 



f* 

a — >% 



i4t 1 a (si) = tit 1 a (s i ){l + 0(s*/n)}, 
The mean and the variances of these messages are given by 



b^a 



bjia 



C. Large (3 limit 



In the limit (3 — + oo, we can simplify the functions Fp 
and Gp. Consider the soft thresholding function r](x; b) = 
sign(x)(|a;| — £>)+. It is well known that this admits the 
alternative characterization 



,/(.<■://) = iii'-niin,.^;, <j |s| + 2^( s ~ x ? 



(9) 



In the (3 — > oo limit, the integral that defines F l a(a;;&) is 
dominated by the maximum value of the exponent, that corre- 
sponds to s* = 7y(x; b) and therefore F^(a;; 6) — > ry(a;; 6). The 
variance (and hence the function G^a^fr)) can be estimated 
by approximating the density fp(s;x,b) near s*. Two cases 
can occur. If 7^ 0, then a Gaussian approximation holds and 
Gp(x; b) = 6(l//3). On the other hand, if = 0, fp(s;x,b) 
can be approximated by a Laplace distribution, leading to 
Gp(x; b) = 9(l//3 2 ) (which is negligible). We summarize this 
discussion in the following. 

Lemma III.3. For bounded x, b, we have 
lim Fp(x; 0) = r](x; b) , lim (3Gp(x; j3) = br\(x\ b) . 

P — >oo P — >oo 



Lemmas |III.l|III.2l and 1111.31 suggest the following equiva- 
lent form for the message passing algorithm (for large (3): 

4t 1 a =v(J2 A >>i z i^? t ), do) 

z a^i = Va — 2_j AajX^ai (H) 

- m = ^E<E^^). (12) 

D. From message passing to AMP 

The updates in Eqs. (TTTT i. dT2b are easy to implement 
but nevertheless the overall algorithm is still rather complex 
because it requires to track 2nN messages. The goal of this 
section is to further simplify the update equations. In order 
to justify the approximation we assume that the messages 
can be approximated as x\^ a = x\ + Sx\_ ta + 0(1/ N), 

=zl + Szl^ + 0(1/N), With SxUaMa^i = 0(^') 

(here the 0( ■ ) errors are assumed uniform in the choice of the 
edge). We also consider a general message passing algorithms 
of the form 

rttl = m(Yl A bi4^i) , *2U< = y* - E < 13 ) 

with {rjt( ■ )}teiN a sequence of differendiable nonlinear func- 
tions with bounded derivatives. Notice that the algorithm 
derived at the end of the previous section, cf. Eqs. (fTTT i. 
Eqs. ( fT2l . is of this form, albeit with rjt non-differentiable 
at 2 points. But this does not change the result, as long as the 
nonlinear functions are Lipschitz continuous. In the interest of 
simplicity, we just discuss the differentiable model. 

Lemma III.4. Suppose that the asymptotic behavior described 
in the paragraph above holds for the message passing algo- 
rithm ( I73I ). Then x\ and z* satisfy the following equations 

X i +1 =Vt(j2 A ™ Z ° +xt i) + 

a 

4 =y a -^A aj x) + ^z t a - 1 (7 1 / t _ 1 (A*z t - 1 +x t - 1 ))+o N {l), 
j 

where the 0^(1) terms vanish as N,n — > oo. 

As a consequence, the resulting algorithm can be written in 
the vector notation as follows: 

x t+1 =r,{A*z t +x t ;f t ), (14) 

z* = y - Ax 1 + \z t ~ 1 {r 1 '{A*z t - 1 + x*" 1 ; f*" 1 )) , (15) 
o 

where ( • ) denotes the average of a vector. 
We also get the following recursion for f: 

f t = ^-(v'(A*z t - 1 + x t ;r t - 1 )). (16) 



E. Comments 

Threshold level. The derivation presented here yields a 
'parameter free' algorithm. The threshold level f* is updated 
by the recursion in Eq. ( fToT ). One could take the alternative 
point of view that f * is a parameter to be optimized. This point 
of view was adopted in Q], 13 . It is expected that the two 
points of view coincide in the large system limit, but it might 
be advantageous to consider a general sequence of thresholds. 

Mathematical derivation of AMP. We showed that in a 
specific limit (large systems, and large f3) the sum-product 
update rules can be significantly simplified to ( [Pil l. ( [TBI ). We 
should emphasize that our results concern just a single step of 
the iterative procedure. As such they do not prove that the sum- 
product messages are carefully tracked by Eqs. ( TT4"1 >. ( fTST ). In 
principle it could be that the error terms in our approximation, 
while negligible at each step, conjure up to become large after 
a finite number of iterations. We do not expect this to be the 
case, but it is nevertheless an open mathematical problem. 

IV. AMP for BPDN/Lasso 

Another popular reconstruction procedure in compressed 
sensing is the following problem 

minimize A||s||i + —\\y — As 1 1 1 . (17) 

The derivation of the corressponding AMP is similar to the 
one in the previous section. We therefore limit ourself to 
mentioning a few differences. 

As before we define a joint density distribution on the 
variables s = (si, . . . , sjy) 

N n „ 

Mds) = zHexp(-f3\\sA) J] exp { - |(y a - (As) a ) 2 } ds . 

i—l a— 1 

The mode of this distribution coincides with the solution of 
the problem ( fTTT i and the distribution concentrates on its mode 
as (3 — > oo. The sum-product algorithm is 

^ 1 a (^)=exp(-/3A| Sl |)n^(^), 

bjta 

"Ui(si)= /ex P { - ^(y a - (As) a ) 2 } n^«(*i) ■ 

j& 

Proceeding as above, we derive an asymptotically equivalent 
form of the belief propagation for N — > oo and (3 — > oo. We 
get the following algorithm in the vector notation: 

x l = r j (x t +A*z t ;X + j t ) ) (18) 

z t+1 = y-Ax t + \^M(x*- x + A*z t - 1 ), ) (19) 
o 

which generalize Eqs. (fl4l and dl~5T >. The threshold level is 
computed iteratively as follows 

j t+1 = ^^W(Az t + x t ; 1 t + X)). (20) 

Notice that the only deviation from the algorithm in the 
previous section is in the recursion for the threshold level. 



V. AMP FOR RECONSTRUCTION WITH PRIOR 
INFORMATION 

In the two cases we discussed so far, the distribution of the 
signal s was not known. This is a very natural and practical 
assumption. Nevertheless, it might be possible in specific sce- 
narios to estimate the input distribution. This extra information 
may be used to improve the recovery algorithms. Also, the 
case of known signal distribution provides a benchmark for 
the other approaches. In this section we define a very simple 
iterative theresholding algorithm for these situations. 

Let a = aixa2---xa7vbe the joint probability distribution 
of the variables si, S2, ■ ■ ■ , sn- It is then natural to consider 
the distribution 

1 n 8 N 

/*(<**) = - exp { - |(y„ - {As) a f} JJai(dsi) , 

a—l " i—1 

since p is the a posteriori distribution of s, when y = As + w 
is observed. Here, w is a noise vector with i.i.d. normal entries 
and is independent of s. The sum-product update rules are 

i>U(*)= J exp{-^(y a -(A S ) Q ) 2 }l]^a(d^)- 

Notice that the above update rules are well defined. At each 
iteration t, the message ^*^(dsi) is a probability measure on 
R, and the first equation gives its density with respect to an. 
The message ^^(sj) is instead a non-negative measurable 
function (equivalently, a density) given by the second equation. 
Clearly the case studied in the previous section corresponds 
to ai-^exp(-/3|s 4 |). 

In order to derive the simplified version of the message pass- 
ing algorithm, we introduce the following family of measures 
over 1R 

fi(ds;x,b) = — -!— — exp( - — (s - x) 2 } a 4 (ds) , (21) 

indexed by i £ [N], x £ R, b £ R+ (/3 is fixed here). The 
mean and the variance of this distribution define the functions 
(here Z~ /<(■;!, 6)) 

Fi(x;b) =% 4 (. ; x ,b)(Z), Gi(x-,b) = V&r fi{ .. x ^(Z) . (22) 

These functions have a natural estimation-theoretic interpreta- 
tion. Let Xi be a random variable with distribution 04, and 
assume that Yi = Xi + W% is observed with W% gaussian noise 
with variance b/(3. The above functions are -respectively- the 
conditional expectation and conditional variance of Xi, given 
that Yi = x: 

F t (x; b) = E(X t \Yi = x) , Gi{x; b) = Var(X i |Y = x) . 

The approach described in Section [HI] yields the following 
AMP (in vector notation) 

x * = F( x t + A*z t ;\ + ~/ t ), (23) 

z t+1 = y-Ax t + \z t (F l {x t - 1 +A*z t - 1 )). (24) 




Here, if x G M, N , F(x;b) £ R N is the vector F(x;b) = 
(F 1 (x i ;b),F 2 (x 2 ;b), . . . ,F N (x N - 1 b)). Analogously F'(x) = 
(F' 1 (xi] b), F 2 (x2 ;&),..., F' N (xjy; b)) (derivative being taken 
with respect to the first argument). Finally, the threshold level 
is computed iteratively as follows 

1 t+1 = l(G(Az t +x t ;j t + \)). (25) 


A. Comments 

The AMP algorithm described in this section is marginally 
more complex than the ones in the previous sections. The 
main difference is that the soft thresholding function 77 ( • ) 
is replaced with the conditional expectation F( • ). While the 
latter does not admit, in general, a closed form expression, it 
is not hard to construct accurate approximations that are easy 
to evaluate. 

VI. Related work 

In this section we would like to clarify the relation of the 
present approach with earlier results in the literature. Each of 
these lines of work evolved from different motivations, and 
there was so far little - if any - contact between them. 

A. Other message passing algorithms 

The use of message passing algorithms for compressed 
sensing problems was suggested before, see for instance J6). 
However such a proposal faced two major difficulties. 

(1) According to the standard prescription, messages used in 
the the sum-product algorithm should be probability measures 
over the real line R, cf. Eqs. (0, (0). This is impractical from 
a computational point of view. (A low complexity message- 
passing algorithm for a related problem was used in Q). 

(2) The factor graph on which the sum-product algorithm is 
run is the complete bipartite graph with N variable nodes, and 
n function nodes. In other words, unless the underlying matrix 
is sparse [8 |, the graphical model is very dense. This requires 
to update Nn messages per iteration, and each message update 
depend on N or n input messages. Again this is very expensive 
computationally. 

The previous pages show that problem (2) does not add 
to (1), but in fact solves it! Indeed, the high density of the 
graph leads to approximately Gaussian messages from factor 
nodes to variable nodes, via central limit theorem. Gaussian 
messages are in turn parametrized by two numbers: mean and 
variance. It turns out that is is sufficient to keep track only of 
the means, again because of the high density. 

Problem (2) is also solved by the high density nature of the 
graph, since all the messages departing from the same node 
of the graph are very similar with each other. 

One last key difficulty with the use of belief propagation in 
compressed sensing was 

(3) The use of belief propagation requires to define a prior on 
the vector s a . For most applications, no good prior is available. 

The solution of this problem lies in using a Laplace prior 
as in Eq. (O. A first justification of this choice lies in the fact 



that, as (3 — > oo, the resulting probability measure concentrates 
around its mode, that is the solution of the basis pursuit 
problem (Q~|). A deeper reason for this choice is that it is 
intimately related to the soft threshold non-linearity rj(x; 9), 
which is step-by-step optimal in a minimax sense HI, |[5). 

B. Historical background and statistical physics 

There is a well studied connection between statistical 
physics techniques and message passing algorithms (9|. In par- 
ticular, the sum-product algorithm corresponds to the Bethe- 
Peierls approximation in statistical physics, and its fixed points 
are stationary points of the Bethe free energy. In the context 
of spin glass theory, the Bethe-Peierls approximation is also 
referred to as the 'replica symmetric cavity method'. 

The Bethe-Peierls approximation postulates a set of non- 
linear equations on quantities that are equivalent to the sum- 
product messages, and which are in correspondence with local 
marginals. In the special cases of spin glasses on the complete 
graph (the celebrated Sherrington-Kirkpatrick model), these 
equations reduce to the so-called TAP equations, named after 
Thouless, Anderson and Palmer who first used them iflOl . 

The original TAP equations where a set of non-linear 
equations for local magnetizations (i.e. expectations of a single 
variable). Thouless, Anderson and Palmer first recognized that 
naive mean field is not accurate enough in the spin glass 
model, and corrected it by adding the so called Onsager reac- 
tion term that is analogous to the last term in Eq. (fT~5T >. More 
than 30 years after the original paper, a complete mathematical 
justification of the TAP equations remains an open problem 
in spin glass theory, although important partial results exist 
ifTTI . While the connection between belief propagation and 
Bethe-Peierls approximation stimulated a considerable amount 
of research lfl2l . the algorithmic uses of TAP equations have 
received only sparse attention. Remarkable exceptions include 

m, ma, ma. 

C. State evolution and replica calculations 

In the context of coding theory, message passing algorithms 
are analyzed through density evolution [16]. The common 
justification for density evolution is that the underlying graph 
is random and sparce, and hence converges locally to a tree in 
the large system limit. In the case of trees density evolution 
is exact, hence it is asymptotically exact for sparse random 
graphs. 

State evolution is the analog of density evolution in the case 
of dense graphs. For definitions and results on state evolution 
we refer to the (TJ, 0. The success of state evolution cannot 
be ascribed to the locally tree-like structure of the graph, and 
calls for new mathematical ideas. 

The fixed points of state evolution describe the output of the 
corresponding AMP, when the latter is run for a sufficiently 
large number of iterations (independent of the dimensions 
n,N). It is well known, within statistical mechanics |9|, that 
the fixed point equations do indeed coincide with the equa- 
tions obtained through a completely different non-rigorous 
approach, the replica method (in its replica- symmetric form). 



This is indeed an instance of a more general equivalence 
between replica and cavity methods. 

During the last few months, several papers investigated 
compressed sensing problems using the replica method |[T7ll . 
fTH . fl9l . In view of the discussion above, it is not surprising 
that these results can be recovered from the state evolution 
formalism put forward in (TJ. Let us mention that the latter 
has several advantages over the replica method: (1) It is more 
concrete, and its assumptions can be checked quantitatively 
through simulations; (2) It is intimately related to efficient 
message passing algorithms; (3) It actually allows to predict 
the performances of these algorithms. 
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